Like other machine learning algorithms, deep learning aims to find a model that can best describe the knowledge behind the data. However, deep learning models, i.e., deep neural networks, are much more complicated than models established using other machine learning algorithms. Due to this fact, the search for the optimal model will require an iterative optimization process to minimize the loss function. A method or an algorithm that can dictate this search process is called a solver or an optimizer. Solvers are an essential part of deep learning because we usually need to specify which solver to use. Also, hyperparameters need to be determined to fix the solver. Common solvers used in machine learning include SGD, Momentum, NAG, Adagrad, Adadelta, RMSProp, AdaMax, and Nadam.
SGD
In a broad sense, SGD has three types: batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. In many cases, SGD is thought to be mini-batch gradient descent. The essence of all of the three types is gradient descent as follows.
where g_(t)g_{t} is the gradient, ww is the model parameter to be updated, the subscripts tt and t+1t+1 are the current and following steps, respectively, eta\eta is the learning rate, and ff is the objective/loss function.
Batch gradient descent uses all the samples, i.e., the whole training set as the batch, to calculate the gradient and then use this gradient to update the model parameters. If the loss function is a convex function, then we can expect to find the global optimal (minimal) value, i.e., the global minimal loss function value and the corresponding model parameters, if the learning rate is small enough. However, the loss function of deep learning problems is, in general, very complicated and usually is not convex. Thus, the identification of the global minimum is usually infeasible. Notwithstanding, a local minimal value can still ensure a decent model. Batch gradient descent can guarantee a local optimal value and can be relatively easy to implement. However, the method poses a high demand for memory as all the data needs to be processed together. Due to this reason, the gradient computation will take a long time.
Stochastic gradient descent takes a way that is completely opposite to batch gradient descent. Stochastic gradient descent calculates the gradient and updates the model after every sample is processed. Thus, this essentially uses the gradient determined by one sample, or viewed as the "slope" of a small area in the objective function, to determine which direction to go for optimizing the loss function. Because of the differences between samples - different samples may suggest much different or even opposite directions - sometimes it will take much more effort to find the best optimization direction. Also, due to the same reason, stochastic gradient descent is less likely to be trapped at a local optimal point compared with batch gradient descent. Meanwhile, stochastic gradient descent cannot utilize array operations to improve computational efficiency.
Mini-batch gradient descent is proposed to reach a compromise between batch gradient descent and stochastic gradient descent. It uses a subset of the training data as a mini-batch to calculate the gradient and update the model parameters. This method is adopted by many deep learning packages as the default SGD or even the default solver.
The selection of the learning rate in SGD is very difficult. In many cases, this can only be done via experience and/or trial and error. Please note that the difficulty of training deep neural networks is not only caused by the local minimum. The existence of saddle points can also trap the solver. A saddle point has zero slopes (derivatives) in orthogonal directions. Thus, the solver may think the global optimum has been reached as zero gradients are found in both directions.
Figure 9.11: Gradient descent method with manual learning rate
Momentum
As shown in Fig. 9.11, SGD only relies on the gradient of the current batch. Therefore, in the early stage of training, the update can be unstable, while in the late stage, the solver does not have enough power to jump off local optima or saddle points. The Momentum method incorporates the gradient of the previous update step in addition to the use of the gradient of the current step. In this way, the optimization direction of the previous update is included in the gradient of the current step as a "momentum".
The gradient can be large in the early stage, so rho\rho typically adopts a relatively small value like 0.5 . As the gradient decreases in the later training stage, rho\rho can be increased to values like 0.9.
Nesterov
The Nesterov or Nesterov Momentum can be viewed as a modification of the Momentum method. Compared with Momentum, Nesterov uses the gradient at a "future" point for update. For this purpose, we first move from the current point by the momentum to reach this future point at w_(t)+rho*Deltaw_(t-1)w_{t}+\rho \cdot \Delta w_{t-1} (denoted as Step t-(1)/(2)t-\frac{1}{2} ). Next, we calculate the gradient at this intermediate point, g_(t+(1)/(2))g_{t+\frac{1}{2}}. The momentum and the gradient will be added up as Delta_(t)\Delta_{t} for update. Mathematically, this can be written as
The learning rate in the above gradient descent is critical to the performance of this method. However, these gradient descent methods feature the use of a learning rate that is manually adjusted. The adjustment of the learning rate can significantly determine the learning outcome.
AdaGrad
AdaGrad is proposed to automatically adjust the learning rate. The update on the model parameters is performed as follows:
where eta\eta is the initial learning rate, and epsilon\epsilon is a small number used to avoid a zero denominator.
As can be seen in the above equation, the learning rate will gradually decrease as the update distance ( sum Deltaw_(t)\sum \Delta w_{t} ) increases.
AdaDelta
AdaDelta is proposed to address three issues with AdaGrad: 1) the learning rate can only monotonically decrease, which can lead to excessively small learning rates in late training stages, 2) the units on two sides of the above AdaGrad update equation are not consistent, and 3) the initial learning rate still needs to be set manually.
As for 1), AdaDelta only uses the expectation of g_(t)^(2)g_{t}^{2} instead of the sum.
Second, it still uses an initial learning rate as AdaGrad and the Root Mean Square (RMS): RMS|g|_(t)=sqrt(E[g^(2)]_(t)+epsilon)R M S|g|_{t}=\sqrt{\mathbb{E}\left[g^{2}\right]_{t}+\epsilon}. Then, the update equation is as follows.
The use of the initial (global) learning rate renders RMSprop suitable for unstable targets. Thus, it can generate relatively good results for RNN.
Adam
Adaptive Moment Estimation (Adam) is another method that can compute adaptive learning rates. Adam is essentially RMSprop with momentum. It uses the method of moments including the first and second order moments for calculating the learning rate.
First, the first order and second order moment estimation for computing the gradient is as follows:
Adam fixes the ranges of the learning rate due to the use of the above adjustments, leading to stable variations of the learning rate. It integrates AdaGrad's advantage in handling sparse gradients and RMSprop's advantage in dealing with stable targets. Besides, Adam has a relatively low demand for internal memory and can set adaptive learning rates for different parameters. It is applicable to most non-convex optimization problems and is suitable for large datasets and high-dimensional data.
Nadam
Nadam can be viewed as an Adam variation with Nesterov momentum.